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Abstract. The EM algorithm is a special case of a more general al- 
gorithm called the MM algorithm. Specific MM algorithms often have 
nothing to do with missing data. The first M step of an MM algo- 
rithm creates a surrogate function that is optimized in the second M 
step. In minimization, MM stands for majorize-minimize; in maximiza- 
tion, it stands for minorize-maximize. This two-step process always 
drives the objective function in the right direction. Construction of MM 
algorithms relies on recognizing and manipulating inequalities rather 
than calculating conditional expectations. This survey walks the reader 
through the construction of several specific MM algorithms. The po- 
tential of the MM algorithm in solving high-dimensional optimization 
and estimation problems is its most attractive feature. Our applications 
to random graph models, discriminant analysis and image restoration 
showcase this ability. 

Key words and phrases: Iterative majorization, maximum likelihood, 
inequalities, penalization. 



1. INTRODUCTION 

This survey paper tells a tale of two algorithms 
born in the same year. We celebrate the christening 
of the EM algorithm by Dempster, Laird and Rubin 
(1977) for good reasons. The EM algorithm is one 
of the workhorses of computational statistics with 
literally thousands of applications. Its value was al- 
most immediately recognized by the international 
statistics community. The more general MM algo- 
rithm languished in obscurity for years. Although in 
1970 the numerical analysts Ortega and Rheinboldt 
(1970) allude to the MM principle in the context of 
line search methods, the first statistical application 
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occurs in two papers (de Leeuw, 1977; de Leeuw and 
Heiser, 1977) of de Leeuw and Heiser in 1977 on mul- 
tidimensional scaling. One can argue that the un- 
fortunate neglect of the de Leeuw and Heiser papers 
has retarded the growth of computational statistics. 
The purpose of the present paper is to draw atten- 
tion to the MM algorithm and highlight some of its 
interesting applications. 

Neither the EM nor the MM algorithm is a concre- 
te algorithm. They are both principles for creating 
algorithms. The MM principle is based on the no- 
tion of (tangent) majorization. A function g{6 \ 6"') 
is said to majorize a function f{0) provided 

/(n=9(^"m, 

(1) 

f{0)<9{0\91, e^9^. 

In other words, the surface i— )• g^OlO"^) lies above 
the surface f{6) and is tangent to it at the point 
6 = 0"". Here 6'^ represents the current iterate in a 
search of the surface f{0). The function g{6\9''^) mi- 
norizes f{6) if —g{6\0'^) majorizes —f{0). Readers 
should take heed that the term majorization is used 
in a different sense in the theory of convex functions 
(Marshall and Olkin, 1979). 

In the minimization version of the MM algorithm, 
we minimize the surrogate majorizing function 
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5((9|6'") rather than the actual function f{e). If 6'"+i 
denotes the minimum of the surrogate g{6\0'^), then 
one can show that the MM procedure forces f{9) 
downhill. Indeed, the relations 

(2) /(r+i) < 5(r+i|r) < = /(r) 

follow directly from the definition of 9^~^^ and the 
majorization conditions (1). The descent property 
(2) lends the MM algorithm remarkable numerical 
stability. Strictly speaking, it depends only on de- 
creasing the surrogate function not on mini- 
mizing it. This fact has practical consequences when 
the minimum of cannot be found exactly. In 
the maximization version of the MM algorithm, we 
maximize the surrogate minorizing function g{9\9^). 
Thus, the acronym MM does double duty, serving as 
an abbreviation of both pairs "majorize-minimize" 
and "minorize-maximize." The earlier, less memo- 
rable name "iterative majorization" for the MM al- 
gorithm unfortunately suggests that the principle is 
limited to minimization. 

The EM algorithm is actually a special case of the 
MM algorithm. If f{9) is the log- likelihood of the 
observed data, and Q{9\9^) is the function created 
in the E step, then the minorization 

is the key to the EM algorithm. Maximizing Q{9 \ 9'^) 
with respect to 9 drives f{9) uphill. The proof of 
the EM minorization relies on the nonnegativity of 
the Kullback-Leibler divergence of two conditional 
probability densities. The divergence inequality in 
turn depends on Jensen's inequality and the con- 
cavity of the function Inx (Hunter and Lange, 2004; 
Lange, 2004). 

In our opinion, the MM principle is easier to state 
and grasp than the EM principle. It requires neither 
a likelihood model nor a missing data framework. 
In some cases, existing EM algorithms can be de- 
rived more easily by isolating a key majorization 
or minorization. In other cases, it is quicker and 
more transparent to postulate the complete data 
and calculate the conditional expectations required 
by the E step of the EM algorithm. Many problems 
involving the multivariate normal distribution fall 
into this latter category. Finally, EM and MM algo- 
rithms constructed for the same problem can differ. 
Our second example illustrates this point. Which al- 
gorithm is preferred is then a matter of reliability in 
finding the global optimum, ease of implementation, 
speed of convergence and computational complexity. 



This is not the first survey paper on the MM al- 
gorithm and probably will not be the last. The pre- 
vious articles (Becker, Yang and Lange, 1997; de 
Leeuw, 1994; Heiser, 1995; Hunter and Lange, 2004; 
Lange, Hunter and Yang, 2000) state the general 
principle, sketch various methods of majorization 
and present a variety of new and old applications. 
Prior to these survey papers, the MM principle sur- 
faced in robust regression (Huber, 1981), correspon- 
dence analysis (Heiser, 1987), the quadratic lower 
bound principle (Bohning and Lindsay, 1988), al- 
ternating least squares applications (Bijleveld and 
de Leeuw, 1991; Kiers, 2002; Kiers and Ten Berge, 
1992; Takane, Young and de Leeuw, 1977), med- 
ical imaging (De Pierro, 1995; Lange and Fessler, 
1994) and convex programming (Lange, 1994). Re- 
cent work has demonstrated the utility of MM al- 
gorithms in a broad range of statistical contexts, 
including quantile regression (Hunter and Lange, 
2000), survival analysis (Hunter and Lange, 2002), 
nonnegative matrix factorization (Elden, 2007; Lee 
and Seung, 1999, 2001; Pauca, Piper and Plemmous, 
2006), paired and multiple comparisons (Hunter, 
2004), variable selection (Hunter and Li, 2005), DNA 
sequence analysis (Sabatti and Lange, 2002) and dis- 
criminant analysis (Groenen, Nalbantov and Bioch, 
2006; Lange and Wu, 2008). 

The primary purpose of this paper is to present 
MM algorithms not featured in previous surveys. 
Some of these algorithms are novel, and some are 
minor variations on previous themes. Except for our 
first two examples in Sections 2 and 3, it is unclear 
whether any of the algorithms can be derived from 
a missing data perspective. This fact alone distin- 
guishes them from standard EM fare. In digesting 
the examples, readers should notice how the MM al- 
gorithm interdigitates with other algorithms such as 
block relaxation and Newton's method. Classroom 
expositions of computational statistics leave the im- 
pression that different optimization algorithms act 
in isolation. In reality, some of the best algorithms 
are hybrids. The examples also stress penalized es- 
timation and high-dimensional problems that chal- 
lenge traditional algorithms such as scoring and New- 
ton's method. Such problems are apt to dominate 
computational statistics and data mining for some 
time to come. The MM principle offers a foothold in 
the unforgiving terrain of large data sets and high- 
dimensional models. 

Two theoretical skills are necessary for construct- 
ing new MM algorithms. One is a good knowledge 
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of statistical models. Another is proficiency with in- 
equalities. Most inequalities are manifestations of 
convexity. The single richest source of minorizations 
is the supporting hyperplane inequality 

f{x)>f{y) + df{y){x-y) 

satisfied by a convex function f{x) at each point y 
of its domain. Here df{y) is the row vector of partial 
derivatives of f(x) at y. 

The quadratic lower bound principle of Bohning 
and Lindsay (1988) propels majorization when the 
objective function has bounded curvature. Let cPf{x) 
be the second differential (Hessian) of the objective 
function f{x), and suppose B is a positive definite 
matrix such that B — d?f{x) is positive semidefinite 
for all arguments x. Then we have the majorization 

f{x) = f{y) + dfiy)ix-y) 

+ \{x-yfd^f{z){x-y) 

<f{y) + df{y){x-y) 

+ ^{x-yYB{x-y), 

where z falls on the line segment between x and y. 
Minimization of the quadratic surrogate is straight- 
forward. In the unconstrained case, it involves inver- 
sion of the matrix B, but this can be done once in 
contrast to the repeated matrix inversions of New- 
ton's method. Other relevant majorizations and mi- 
norizations will be mentioned as needed. Readers 
wondering where to start in brushing up on inequal- 
ities are urged to consult the elementary exposition 
(Steele, 2004). The more advanced texts (Boyd and 
Vandenberghe, 2004; Lange, 2004) are also useful for 
statisticians. 

Finally, let us stress that neither EM nor MM is 
a panacea. Optimization is as much art as science. 
There is no universal algorithm of choice, and a good 
deal of experimentation is often required to choose 
among EM, MM, scoring, Newton's method, quasi- 
Newton methods, conjugate gradient, and other more 
exotic algorithms. The simplicity of MM algorithms 
usually argues in their favor. Balanced against this 
advantage is the sad fact that many MM algorithms 
exhibit excruciatingly slow rates of convergence. Sec- 
tion 8 derives the theoretical criterion governing the 
rate of convergence of an MM algorithm. Fortu- 
nately, MM algorithms are readily amenable to ac- 
celeration. For the sake of brevity, we will omit a 
detailed development of acceleration and other im- 
portant topics. Our discussion in Section 9 will take 
these up and point out pertinent references. 



2. ESTIMATION WITH THE 
MULTIVARIATE T 

The multivariate t-distribution has density 
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for all x gRp. Here fi is the mean vector, Q is the 
positive definite scale matrix and > is the de- 
grees of freedom. Let xi, . . . ,Xjn be a random sam- 
ple from f{x). To estimate fi and for fixed, the 
well-known EM algorithm (Lange, Little and Tay- 
lor, 1989; Little and Rubin, 2002) iterates according 
to 

^ m 
^ m 

(4) = — y w'^ixi - /i"+^)(xi - ^JI'+^)\ 



i=l 



where s" = X^^^i w'f is the sum of the case weights 

jn 1^ ..n\trr,n\-lr. 



u + d^l' 



d? 



[Xi 



The derivation of the EM algorithm hinges on the 
representation of the t-density hidden mixture 
of multivariate normal densities. 

Derivation of the same algorithm from the MM 
perspective ignores the missing data and exploits the 
concavity of the function Inx. Thus, the supporting 
hyperplane inequality 



— In X > — In y 
implies the minorization 
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for case i, where c" is a constant that depends on 
neither // nor Q. Summing over the different cases 
produces the overall surrogate. Derivation of the up- 
dates (3) and (4) reduces to standard manipulations 
with the multivariate normal (Lange, 2004). 

Kent, Tyler and Vardi (1994) suggest an alterna- 
tive algorithm that replaces the EM update (4) for 
Q by 



To show that Q = minimizes the surrogate, let 
Ai, . . . , Ap denote the eigenvalues of the positive def- 
inite matrix R~^Q,R~^ . This allows us to express the 
surrogate as a negative multiple of the function 

MA)=.ri^"+ri^"Ev'- 



(5) 



n+l 



<(xi-/i"+i)(x,-/i"+^) 



n+l\t 



Megan and van Dyk (1997) justify this modest amend- 
ment by expanding the parameter space to include a 
working parameter that is tweaked to produce faster 
convergence. It is interesting that a trivial variation 
of our minorization produces the Kent, Tyler and 
Vardi (1994). We simply combine the two log terms 
and minorize via 

In |0| - ln[u + {xi - fiYn-\xi - fi)] 



u +p 



in{\nni^ + {xi-nyn-'{x,-fi)]} 



> 



i_^{|J7|<^[^+(3,,-;,)*f]-i(x,-/z)]} 



, 

with working parameter a = l/i^v +p). 

For readers wanting the full story, we now indicate 
briefly how the second step of the MM algorithm is 
derived. This revolves around maximizing the sur- 
rogate function 

m 

- ^<{|or[i/ + {xi - fiyn~\xi - n)]} 

i=l 

with respect to fi and Q. Regardless of the value 
of $7, one should choose /j, as the weighted mean (3). 
If we let R be the square root of 0"+^ as defined by 
(5) and substitute in the surrogate, then the 

refined surrogate function can be expressed 

-s''{\n\''[u + tT{Q~^R^)]} 

= -s''{\R~^nR~Y[iy + tr{Rn~^R)]}\R\'^''. 



The choice A = 1 corresponds to Q = R^ and yields 
the value h{l) = v The identity VL = can now 
be proved by showing that -|- p is a lower bound 
for /i(A). Setting \j = ^\ a simple rearrangement 
of the bounding inequality shows that it suffices to 
prove the alternative inequality 



-i/('^+p)E^=iej < 



1 ^ 



which is a direct consequence of the convexity of . 

3. GROUPED EXPONENTIAL DATA 

The EM algorithm for estimating the intensity of 
grouped exponential data is well known (Dempster, 
Laird and Rubin, 1977; McLachlan and Krishnan, 
1997; Meilijson, 1989). In this setting the complete 
data corresponds to a random sample x\^ . . . ^Xm 
from an exponential density with intensity A. The 
observed data conforms to a sequence of thresholds 
t\ < < • ■ • < tm- It is convenient to append the 
threshold to = to this list and to let q record 
the number of values that fall within the interval 
(ti,ij+i]. The exceptional count Cm represents the 
number of right-censored values. One can derive a 
novel MM algorithm by close examination of the 
log-likelihood 

L(A) =coln(l -e-^*i) 

m— 1 

■^^Ci ln(e-^*' - e-^*'+0 - c^At^ 



i=l 
m— 1 



m— 1 



-A ^ Qti+i - c^At^ ^ Ci ln(e^'^' - 1), 



i=0 



i=0 



where di = tj+i — tj. 

The above partial linearization of the log-likelihood 
-L(A) focuses our attention on the remaining nonlin- 
ear parts of L{X) determined by the function /(A) = 
ln(e^'^ — 1). The derivatives 



/'(A) 



/"(A) 



e^'^d^ 
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Table 1 







Comparison of MM and EM on groupe 


d exponential data 








MM algorithm 




EM algorithm 


71 


A" 




A" 







1.00000 


-3.00991 


1.00000 


-3.00991 


1 


0.50000 


-1.75014 


0.27082 


-1.34637 


2 


0.25000 


-1.32698 


0.21113 


-1.30591 


3 


0.18924 


-1.30528 


0.20102 


-1.30443 


4 


0.19762 


-1.30438 


0.19904 


-1.30437 


5 


0.19848 


-1.30437 


0.19864 


-1.30437 


6 


0.19853 


-1.30437 


0.19856 


-1.30437 


7 


0.19854 


-1.30437 


0.19854 


-1.30437 



indicate that /(A) is increasing and concave. It is 
impossible to minorize /(A) by a linear function, 
so we turn to the quadratic lower bound principle. 
Hence, in the second-order Taylor expansion 

/(A) = /(A") + /'(A")(A-A") 

+ ir(A.)(A-A")2, 

with fi between A and A", we seek to bound 
from below. One can easily check that f"{^j) is increa- 
sing on (0, oo) and tends to — oo as /.f approaches 0. 
To avoid this troublesome limit, we restrict A to the 
interval (^A'^,oo) and substitute f'WX^) for /"(/i). 
Minorizing the nonlinear part of L[X) term by term 
now gives a quadratic minorizer q{\) of L{\). Be- 
cause the coefficient of A^ in q{X) is negative, the re- 
stricted maximum A""'''^ of q{\) occurs at the bound- 
ary ^A" whenever the unrestricted maximum occurs 
to the left of ^A". In symbols, the MM update re- 
duces to 



\n+l 



max! -A", 
12 



A" + 



Em— I n 
i=0 Ci< 



where 



di 



11 



(^gA"di/2 

Table 1 compares the MM algorithm and the tra- 
ditional EM algorithm on the toy example of Meil- 
ijson (1989). Here we have m = 3 thresholds at 1, 
3 and 10 and assign proportions 0.185, 0.266, 0.410 
and 0.139 to the four ordinal groups. It is clear that 
the MM algorithm hits its lower bound on iterations 
1 and 2. Although its local rate of convergence ap- 
pears slightly better than that of the EM algorithm. 



the differences are minor. The purpose of this exer- 
cise is more to illustrate the quadratic lower bound 
principle in deriving MM algorithms. 

4. POWER SERIES DISTRIBUTIONS 

A family of discrete density functions Pk{0) de- 
fined on {0, 1, . . .} and indexed by a parameter > 
is said to be a power series family provided for all k 

q{9) 

where > and q{6) = YlT=o'^l^^'^ appropri- 
ate normalizing constant (Rao, 1973). The binomial, 
negative binomial, Poisson and logarithmic families 
are examples. Zero truncated versions of these fam- 
ilies also qualify. Fisher scoring is the traditional 
approach to maximum likelihood estimation with a 
power series family. If random sample 

from the discrete density (6), then the log-likelihood 



m 



> XihiO — m\D.q{9) 



i=l 



has score s{9) and expected information J{6) 



i=l 



02 



where o''^{9) is the variance of a single realization. 

Functional iteration provides an alternative to scor- 
ing. It is clear that the maximum likelihood estimate 
is a root of the equation 

- 0q'{9) 



(7) 



q{9) ' 



where x is the sample mean. This result suggests the 
iteration scheme 

xq{9^- 



(8) 



M(0") 
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Table 2 

Performance of the algorithm (8) for truncated Poisson data 



n 


0" 


i(0") 


n 


0" 


i(0") 





1.00000 


-5.41325 


7 


1.59161 


-4.34467 


1 


1.26424 


-4.63379 


8 


1.59280 


-4.34466 


2 


1.43509 


-4.40703 


9 


1.59329 


-4.34466 


3 


1.52381 


-4.35635 


10 


1.59349 


-4.34466 


4 


1.56424 


-4.34670 


11 


1.59357 


-4.34466 


5 


1.58151 


-4.34501 


12 


1.59360 


-4.34466 


6 


1.58867 


-4.34472 


13 


1.59362 


-4.34466 



and raises two obvious questions. First, is the algo- 
rithm (8) an MM algorithm? Second, is it likely to 
converge to 9 even in the absence of such a guar- 
antee? Local convergence hinges on the derivative 
condition |Af'(0)| < 1. When this condition holds, 
the map ^""'"^ = M{9^) is locally contractive near 
the fixed point 6. It turns out that 



M\e) = 1 



where 



em 



is the mean of a single realization X. Thus, conver- 
gence depends on the ratio of the variance to the 
mean. To prove these assertions it is helpful to dif- 
ferentiate q{0). The first derivative delivers the mean 
and the second derivative the second factorial mo- 
ment 



E[X(X-1)] 



32 W 



If one substitutes these into the obvious expression 
for M'{9) and invokes equality (7) at 6, then the 
moment form of M'{6) emerges. 

To address the question of whether functional iter- 
ation is an MM algorithm, we make the assumption 
that q{0) is log-concave. This condition holds for 
the binomial and Poisson distributions but not for 
the negative binomial and logarithmic distributions. 
The convexity of —lnq{9) entails the minorization, 

m 

L{9) > Y,Xiln9 - mlng(r) - m[\nq{9'')]'{9 - ^ 

i=l 

= Y,Xiln9 - mlng(r) - m^^{9 - r). 
i=i ' 



Setting the derivative of this surrogate function equal 
to leads to the MM update (8). One can demon- 
strate that log-concavity implies cr^(0) < The lo- 
cal contraction condition |M'(^)| < 1 is consistent 
with the looser criterion cr^(^) < 2fi{9). Thus, there 
is room for a viable local algorithm that fails to have 
the ascent property. 

The truncated Poisson density has normalizing 
function q{9) = — 1. The second derivative test 
shows that q{9) is log-concave. Table 2 records the 
well-behaved MM iterates (8) for the choices x = 2 
and m = 10. The geometric density counting failures 
until a success has normalizing function q(9) = (1 — 
9)~^, which is log-convex rather than log-concave. 
The iteration function is now M{9) = x{l — 9). Since 
M'(9) = —X, the algorithm diverges for x > 1. Fi- 
nally, the discrete logarithmic density has normal- 
izing constant q{9) = — ln(l — 9), which is also log- 
convex rather than log-concave. The choices x = 2 
and m = 10 lead to the iterates in Table 3. Although 
the algorithm (8) converges for the logarithmic den- 
sity, it cannot be an MM algorithm because the log- 
likelihood experiences a decline at its first iteration. 

One of the morals of this example is that many 
natural algorithms only satisfy the descent or as- 
cent property in special circumstances. This is not 
necessarily a disaster, but without such a guaran- 
tee, safeguards must usually be instituted to pre- 
vent iterates from going astray. Proof of the de- 
scent or ascent property almost always starts with 
majorization or minorization. Because so much of 
statistical inference revolves around log-likelihoods, 
log-convexity and log-concavity are possibly more 
important than ordinary convexity and concavity in 
constructing MM algorithms. 

There are a variety of criteria that help in checking 
log-concavity. Besides the obvious second derivative 
test, one should keep in mind the closure properties 
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Table 3 

Performance of the algorithm (8) for logarithmic data 



n 


0" 




n 


0" 


i(0") 





0.99000 


-15.47280 


9 


0.71470 


-8.98294 


1 


0.09210 


-24.32767 


10 


0.71565 


-8.98293 


2 


0.17545 


-18.35307 


11 


0.71517 


-8.98293 


3 


0.31814 


-13.30624 


12 


0.71542 


-8.98293 


4 


0.52221 


-9.96349 


13 


0.71529 


-8.98293 


5 


0.70578 


-8.98560 


14 


0.71535 


-8.98293 


6 


0.71991 


-8.98355 


15 


0.71532 


-8.98293 


7 


0.71291 


-8.98310 


16 


0.71534 


-8.98293 


8 


0.71655 


-8.98297 


17 


0.71533 


-8.98293 



of the collection of log-concave functions on a given 
domain (Bergstrom and Bagnoli, 2005; Boyd and 
Vandenberghe, 2004). For example, the collection is 
closed under the formation of products and posi- 
tive powers. Any positive concave function is log- 
concave. If f{x) > a > for all x, then f{x) — a is 
log-concave. In some cases, integration preserves log- 
concavity. If /(x) is log-concave, then f{y) dy and 
J^f{y)dy are log-concave. When f{x,y) is jointly 
log-concave in (x, y), J /(x, y) dy is log-concave in x. 
As a special case, the convolution of two log-concave 
functions is log-concave. One of the more useful re- 
cent tests for log-concavity pertains to power series 
(Anderson, Vamanamurthy and Vuorinen, 2007). 
Suppose f{x) = X^fclo (^kx'^ has radius of convergence 
r around the origin. If the coefficients are posi- 
tive and the ratio {k + l)ak+i/ak is decreasing in 
k, then /(x) is log-concave on (— r, r). This result 
also holds for finite series /(x) = '^'k=QO'kX^- In nii- 
norization, log-convexity plays the linearizing role 
of log-concavity. The closure properties of the set of 
log-convex functions are equally impressive (Boyd 
and Vandenberghe (2004)). 

5. A RANDOM GRAPH MODEL 

Random graphs provide interesting models of con- 
nectivity in genetics and internet node ranking. Here 
we consider the random graph model of Blitzstein, 
Chatterjee and Diaconis (2008). Their model assigns 
a nonnegative propensity pi to each node i. An edge 
between nodes i and j then forms independently 
with probability PiPj/{l +PiPj)- The most obvious 
statistical question in the model is how to estimate 
the Pi from data. Once this is done, we can rank 
nodes by their estimated propensities. 



If E denotes the edge set of the graph, then the 
log-likelihood can be written as 

Hp)= ^ [Inpi + lnpj] 

(9) 

- ^ ln{l +piPj). 
{id} 

Here {i,j} denotes a generic unordered pair. The 
logarithms ln(l +PiPj) are the bothersome terms in 
the log-likelihood. We will minorize each of these by 
exploiting the convexity of the function — ln(l -|- x). 
Application of the supporting hyperplane inequality 
yields 

-ln{l+p,Pj)>-ln{l+py]) 



and eliminates the logarithm. Note that equality 
holds when pi =p" for all i. This minorization is not 
quite good enough to separate parameters, however. 
Separation can be achieved by invoking the second 
minorizing inequality 

Note again that equality holds when all pi =Pi- 

These considerations imply that up to a constant 
L{p) is minorized by the function 

g{p\p^) = ^ [Inpi + \ii.pj\ 

{ij}6E 

1 1 (Pi 2,Pl 2\ 
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Table 4 

Convergence of the MM random graph algorithm 



n 













0.00100 


0.48240 


0.95613 


-40572252.7109 


1 


0.00000 


0.48281 


0.97251 


—40565250.8333 


2 


0.00000 


0.48220 


0.98274 


-40562587.5350 


3 


0.00000 


0.48151 


0.98950 


-40561497.1411 


4 


0.00000 


0.48093 


0.99408 


-40561038.9534 


5 


0.00000 


0.48050 


0.99720 


-40560843.3998 


10 


0.00000 


0.47963 


1.00299 


-40560695.6515 


15 


0.00000 


0.47950 


1.00387 


-40560693.1245 


20 


0.00000 


0.47948 


1.00400 


-40560693.0770 


25 


0.00000 


0.47948 


1.00403 


-40560693.0761 


30 


0.00000 


0.47948 


1.00403 


-40560693.0761 


35 


0.00000 


0.47948 


1.00403 


-40560693.0764 



The fact that separates parameters aUows us 

to compute p""*"^ by setting the derivative of 
with respect to pi equal to 0. Thus, we must solve 

,Vi 







^ 1 + -p^fl 



Vi 



If di = Yli{i j}eE 1 denotes the degree of node i, then 
the positive square root 



(10) 



n+l 



Pi 



.1/2 



is the pertinent solution. Blitzstein, Chatterjee and 
Diaconis (2008) derive a different and possibly more 
effective algorithm by a contraction mapping argu- 
ment. 

The MM update (10) is not particularly intuitive, 
but it does have the virtue of algebraic simplic- 
ity. When di = 0, it also makes the sensible choice 
p""*"^ = 0. As a check on our derivation, observe that 
a stationary point of the log-likelihood satisfies 

di \ ^ 



0: 



Pi 



Pj 

+ PiPj 



which is just a rearranged version of the update (10) 
with iteration superscripts suppressed. 

The MM algorithm just derived carries with it 
certain guarantees. It is certain to increase the log- 
likelihood at every iteration, and if its maximum 
value is attained at a unique point, then it will also 
converge to that point. It is straightforward to prove 
that the log-likelihood is concave under the repa- 
rameterization pi = e~''\ The requirement of two 
successive minorizations in our derivation gives us 



pause because if minorization is not tight, then con- 
vergence is slow. On the other hand, if the number 
of nodes is large, then competing algorithms such as 
Newton's method entail large matrix inversions and 
are very expensive. 

As a test case for the MM algorithm, we gener- 
ated a random graph on m = 10,000 nodes with a 
propensity pi for node i of {i — ^) /m. To derive ap- 
propriate starting values for the propensities, we es- 
timated a common background propensity q by set- 
ting q'^ /{l + q^) equal to the ratio of observed edges 
to possible edges and solving for q. This background 
propensity was then used to estimate each by set- 
ting Piq/{1+Piq) equal to di/m and solving for pi. 
Table 4 displays the components Pq, P^^2 Pm 
of the parameter vector at iteration n. The log- 
likelihood actually fails the ascent test in the last it- 
eration because its rightmost digits are beyond ma- 
chine precision. Despite this minor flaw, the algo- 
rithm performs impressively on this relatively large 
and decidedly nonsparse problem. As an indication 
of the quality of the final estimate p, the maximum 
error maxj \ {i — \)/m — pi\ was 0.0825 and the aver- 
age absolute error ^ Ylii l(^ ~ \)/''^ ~Pi\ was 0.0104. 

6. DISCRIMINANT ANALYSIS 

Discriminant analysis is another attractive appli- 
cation. In discriminant analysis with two categories, 
each case i is characterized by a feature vector Zi and 
a category membership indicator yi taking the val- 
ues — 1 or 1 . In the machine learning approach to dis- 
criminant analysis (Scholkopf and Smola, 2002; Vap- 
nik, 1995), the hinge loss function [1 — yi{a + z|/3)]+ 
plays a prominent role. Here (ti)+ is shorthand for 
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the convex function max{u, 0}. Just as in ordinary 
regression, we can penalize the overah loss 

n 

g{e) = Y,[^-y,{a + 4!3)\+ 

i=l 

by imposing a lasso or ridge penalty (Hastie, Tib- 
shirani and Friedman, 2001). Note that the linear 
regression function hi{9) = a + zj/3 predicts either 
— 1 or 1. If = 1 and hi{9) overpredicts in the sense 
that hi{9) > 1, then there is no loss. Similarly, if 
Ui = —1 and hi{6) underpredicts in the sense that 
hi{6) < —1, then there is no loss. 

Most strategies for estimating 9 pass to the dual of 
the original minimization problem. A simpler strat- 
egy is to majorize each contribution to the loss by 
a quadratic and minimize the surrogate loss plus 
penalty. A little calculus (Groenen, Nalbantov and 
Bioch, 2006) shows that (u)+ is majorized at ti" 7^ 
by the quadratic 

(11) g(nK) = ^(n + K|)2. 

In fact, this is the best quadratic majorizer (de Leeuw 
and Lange (2009)). To avoid the singularity at 0, we 
recommend replacing ^(m | u") by 



In double precision, a good choice of e is 10^^. If we 
impose a ridge penalty, then the majorization (11) 
leads to a pure MM algorithm exploiting weighted 
least squares. 

If the number of predictors is large, then the ma- 
trix inversions entailed in updating all parameters 
simultaneously become burdensome. Coordinate de- 
scent offers a viable alternative because it updates 
a single parameter at a time. The large number of 
iterations until convergence required by coordinate 
descent is often outweighed by the extreme simplic- 
ity of each parameter update. Quadratic majoriza- 
tion of the hinge losses keeps the updates simple 
and guarantees a reduction in the objective func- 
tion. The decisions to use a lasso or ridge penalty 
and apply pure MM or coordinate descent with ma- 
jorization will be dictated in practical problems by 
considerations of model selection and the number of 
potential predictors. 

In discriminant analysis with more than two cat- 
egories, it is convenient to pass to e-insensitive loss 
and multiple linear regression. Our recently intro- 
duced method of vertex discriminant analysis (VDA) 



(Lange and Wu, 2008) operates in this fashion and 
relies on an MM algorithm. If there are fc -|- 1 cate- 
gories and p predictors, the basic idea is situate the 
class indicators at the vertices of a regular simplex 
in R'"' and minimize the criterion 

1 " 

(12) R(A,b) = -y^\\y,-Azi-b\U 

i=l 

k 

+ A5^||a,-||2, 

i=i 

where yi is the vertex assigned to case a* is the 
j'th row oiakxp matrix A of regression coefficients, 
6 is a A; X 1 column vector of intercepts, and 

(13) = max{||?;|| — £,0} 

is e-insensitive Euclidean distance. Once A and b 
are estimated, we can assign a new case to the clos- 
est vertex, and hence category. One can design a 
quadratic surrogate by application of the Cauchy- 
Schwarz inequality and minimize the surrogate by 
solving k coordinated least squares problems. The 
combination of a parsimonious loss function and an 
efficient MM algorithm make VDA one of the most 
effective discriminant analysis methods tested (Lange 
and Wu, 2008). 

As a comparison of hinge-loss discriminant anal- 
ysis versus VDA, we now consider four typical ex- 
amples from the UCI machine learning repository 
(Asuncion and Newman, 2007). All four examples 
involve just two categories. For each data set. Ta- 
ble 5 lists the numbers of cases, features, and it- 
erations until convergence, as well as the training 
error rates and the computing times in seconds un- 
der both hinge loss and e-insensitive loss. For VDA 
we set £ = 0.9999, just below the recommended cut- 
off of v^(2A; + 2)//c/2 = 1 for /c 1 = 2 categories. 
The cutoff is the largest e avoiding overlap of the 
e-insensitive spheres around each vertex of the reg- 
ular simplex. We chose the value 10~^ for the tun- 
ing parameter A in all four examples. Our previous 
numerical experience shows that VDA is relatively 
insensitive to the choice of A. Inspection of the train- 
ing errors suggests that the two methods have sim- 
ilar accuracy. To our surprise, VDA is considerably 
faster. 

7. IMAGE RESTORATION AND INPAINTING 

The MM algorithm is also employed in image de- 
convolution (Bioucas-Dias, Figueiredo and Oliveira, 
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Table 5 

Empirical examples from UCI machine learning repository 



Data set Hinge loss VDA 



(Cases, features) Iters Error Time Iters Error Time 



Diabetes (768, 8) 44 0.2266 0.063 11 0.2240 0.015 

SPECT (80, 22) 326 0.2000 0.359 7 0.1750 0.000 

Tic-tac-toe (958, 9) 274 0.0167 0.578 26 0.0167 0.062 

Ionosphere (351, 33) 483 0.0513 2.984 42 0.0570 0.266 



2006; Liao et al., 2002). Suppose a photograph is di- 
vided into pixels and yij is the digitized intensity for 
pixel Some of the yij are missing or corrupted. 
Smoothing pixel values can give a visually improved 
image. Correction of pixels subject to minor corrup- 
tion is termed denoising; correction of missing or 
grossly distorted values is termed inpainting. Let S 
be the set of pixels with acceptable values. We can 
restore the photograph by minimizing the criterion 

iVij - f^ijf + Yl llMij -/^fcdlTV, 

(i,j)GS i j {k,i)eN,j 

where Nij denotes the pixels neighboring pixel («, j), 
||x||tv = V x'^ + £ is the total variation norm with 
e > small and A > is a tuning constant. Let /z^- 
be the current iterate. The total variation penalties 
are majorized using 

IIxIItv < IIx'^IItv + [x' - (x")2] 

based on the concavity of the function ^/t + e. These 
maneuvers construct a simple surrogate function ex- 
pressible as a weighted sum of squares. Other rough- 
ness penalties are possible. For instance, the scaled 
sum of squares A Y.j H{k,i)&N,j if^ij " /^fcO^ is plau- 
sible. Unfortunately, this choice tends to deter the 
formation of image edges. The total variation alter- 
native is preferred in practice because it is gentler 
while remaining continuously differentiable. 

If the pixels are defined on a rectangular grid, then 
we can divide them into two blocks in a checker- 
board fashion, with the red checkerboard squares 
falling into one block and the black checkerboard 
squares into the other block. Within a block, the 
least squares problems generated by the surrogate 
function are parameter separated and hence trivial 
to solve. Thus, it makes sense to alternate the up- 
dates of the blocks. Within a block we update jiij 



via 

„+i ^ %j + ^E(fc,ogjv., f^li/lK - f^Mhv 

~ 2 + AE(fc,o.^,,l/ll/^r,-/^^JlTV 
for (i,j) G S" or via 

~ Eik^N.^yiK-f^lihy 

for ^ S. Here each interior pixel has four 
neighbors. If the singularity constant e is too small 
or if the tuning A is too large, then small residu- 
als generate very large weights. When this pitfall is 
avoided, the described algorithm is apt to be supe- 
rior to the fused lasso algorithm of Friedman, Hastie 
and Tibshirani (2007). 

We applied the total variation algorithm to the 
standard image of the model Lenna. Figure 1 shows 
the original 256 x 256 image with pixel values digi- 
tized on a gray scale from to 255. To the right of 
the original image is a version corrupted by Gaus- 
sian noise (mean and standard deviation 10) and 
a scratch on the shoulder. The images are restored 
with A values of 10, 15, 20 and 25 and an e value of 
1. Although we tend to prefer the restoration on the 
right in the second row, this is a matter of judgment. 
Variations in A clearly control the balance between 
image smoothness and loss of detail. 

8. LOCAL CONVERGENCE OF MM 
ALGORITHMS 

Many MM and EM algorithms exhibit a slow rate 
of convergence. How can one predict the speed of 
convergence of an MM algorithm and choose be- 
tween competing algorithms? Consider an MM map 
M(9) for minimizing the objective function f(9) via 
the surrogate function g{9\9^). According to a theo- 
rem of Ortega (1990), the local rate of convergence 
of the sequence 9^^^ = M{9^) is determined by the 
spectral radius p of the differential dM{9^) at the 
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Fig. 1. Restoration of the Lenna photograph. Top row left: the original image; top row right: image corrupted by Gaussian 
noise (mean and standard deviation 10) and a scratch; second row left: restored image with A = 10; second row right: restored 
image with A = 15; third row left: restored image with A = 20; third row right: restored image with A = 25. The same value 
E = 1 IS used throughout. 



minimum point 6°^ of f{9). Well-known calculations 
(Dempster, Laird and Rubin, 1977; Lange (1995a)) 
demonstrate that 



dM{e° 



I-(fg{9^\9'^)-^d^f{e°°). 



Hence, the eigenvalue equation dM{9°°)v = Xv can 
be rewritten as 

d^g{e^\e'^)v - d^f{e°°)v = \d^g{9^\9^)v. 

Taking the inner product of this with v, we can solve 
for A in the form 



A = l 



v'd'f{9°°)v 
v^d'^g{9°°\9°°)v' 



Extension of this line of reasoning shows that the 
spectral radius satisfies 



P 



1 



. v^d^fiO' 
mm — — r — ; 



3oo 



9°°)v 



Thus, the rate of convergence of the MM iterates is 
determined by how well d'^g{9'^\9°°) approximates 
d'^f{9°°). In practice, the surrogate function g{9\9"') 
should hug f{9) is tightly as possible for 9 close 
to 9"". 

Meng and van Dyk (1997) use this Rayleigh quo- 
tient characterization of the spectral radius to prove 
that the Kent et al. multivariate t algorithm is faster 
than the original multivariate t algorithm. In essence, 
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they show that the second differential d?g{6\9) is 
uniformly more positive definite for the alternative 
algorithm, de Leeuw and Lange (2009) make sub- 
stantial progress in designing optimal quadratic sur- 
rogates. For most other MM algorithms, however, 
such theoretical calculations are too hard to carry 
out, and one must rely on numerical experimenta- 
tion to determine the rate of convergence. The un- 
certainties about rates of convergence are reminis- 
cent of the uncertainties surrounding MCMC meth- 
ods. This should not deter us from constructing MM 
algorithms. On large-scale problems, many tradi- 
tional algorithms are simply infeasible. If we can 
construct a MM algorithm, then there is always the 
chance of accelerating it. We take up this topic briefly 
in the discussion. Finally, let us stress that the num- 
ber of iterations until convergence is not the sole de- 
terminant of algorithm speed. Computational com- 
plexity per iteration also comes into play. On this 
basis, a standard MM algorithm for transmission to- 
mography is superior to a plausible but different EM 
algorithm (Lange, 2004). 

9. DISCUSSION 

Perhaps the best evidence of the pervasive influ- 
ence of the EM algorithm is the sheer number of ci- 
tations garnered by the Dempster et al. paper. As of 
April 2008, Google Scholar hsts 11,232 citations. By 
contrast, Google Scholar lists 58 citations for the de 
Leeuw paper and 47 citations for the de Leeuw and 
Heiser paper. If our contention about the relative 
importance of the EM and MM algorithms is true, 
how can one account for this disparity? Several rea- 
sons come to mind. One is the venue of publication. 
The Journal of the Royal Statistical Society, Series 
B, is one of the most widely read journals in statis- 
tics. The de Leeuw and Heiser papers are buried in a 
hard to access conference proceedings. Another rea- 
son is the prestige of the authors. Four of the five au- 
thors of the three papers. Nan Laird, Donald Rubin, 
Jan de Leeuw and Willem Heiser, were quite junior 
in 1977. On the other hand, Arthur Dempster was a 
major figure in statistics and well established at Har- 
vard, the most famous American university. Besides 
these extrinsic differences, the papers have intrinsic 
differences that account for the better reception of 
the Dempster et al. paper. Its most striking advan- 
tage is the breadth of its subject matter. Dempster 
et al. were able to unify different branches of com- 
putational statistics under the banner of a clearly 



enunciated general principle, de Leeuw and Heiser 
stuck to multidimensional scaling. Their work and 
extensions are well summarized by Borg and Groe- 
nen (1997). 

The EM algorithm immediately appealed to the 
stochastic intuition of statisticians, who are good at 
calculating the conditional expectations required by 
the E step. The MM algorithm relies on inequalities 
and does not play as well to the strengths of statisti- 
cians. Partly for this reason the MM algorithm had 
difficulty breaking out of the vast but placid backwa- 
ter of social science applications where it started. It 
remained sequestered there for years, nurtured by 
several highly productive Dutch statisticians with 
less clout than their American and British colleagues. 

Our emphasis on concrete applications neglects 
some issues of considerable theoretical and practi- 
cal importance. The most prominent of these are 
global convergence analysis, computation of asymp- 
totic standard errors, acceleration, and approximate 
solution of the optimization step (second M) of the 
MM algorithm. Let us address each of these in turn. 

Virtually all of the convergence results announced 
by Dempster et al. (1977) and corrected by Wu (1983) 
and Boyles (1983) carry over to the MM algorithm. 
The known theory, both local and global, is summa- 
rized in the references (Lange, 2004; Vaida, 2005). 
As anticipated, the best results hold in the presence 
of convexity or concavity. The SEM algorithm of 
Meng and Rubin (1991) for computation of asymp- 
totic standard errors also carries over to the MM 
algorithm (Hunter, 2004). Numerical differentiation 
of the score function is a viable competitor, particu- 
larly if the score can be evaluated analytically. The 
simplest form of acceleration is step doubling (de 
Leeuw and Heiser, 1980; Lange and Fessler, 1994). 
This maneuver replaces the point delivered by an 
algorithm map 0"+-^ = M{9^) by the new point 0" + 
2[M(0") —6'^]. Step doubling usually halves the num- 
ber of iterations until convergence in an MM algo- 
rithm. More effective forms of acceleration are pos- 
sible using matrix polynomial extrapolation (Varad- 
han and Roland, 2008) and quasi-Newton and con- 
jugate gradient elaborations of the MM algorithm 
(Jamshidian and Jennrich, 1997; Lange, 1995b). Fi- 
nally, if the optimization step of an MM algorithm 
cannot be accomplished analytically, it is possible to 
fall back on the MM gradient algorithm (Hunter and 
Lange, 2004; Lange, 1995a). Here one substitutes 
one step of Newton's method for full optimization 
of the surrogate function g{9\9'^) with respect to 9. 
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Fortunately, this approximate algorithm has exactly 
the same rate of convergence as the original MM al- 
gorithm. It also preserves the descent or ascent prop- 
erty of the MM algorithm close to the optimal point. 

The reader may be left wondering whether EM or 
MM provides a clearer path to the derivation of new 
algorithms. In the absence of a likelihood function, it 
is difficult for EM to work its magic. Even so, criteria 
such as least squares can involve hidden likelihoods. 
Perhaps the best reply is that we are asking the 
wrong question. After all, one man's mathematical 
meat is often another man's mathematical poison. 
A better question is whether MM broadens the pos- 
sibilities for devising new algorithms. In our view, 
the answer to the second question is a resounding 
yes. Our last four examples illustrate this point. Of 
course, it may be possible to derive one or more of 
these algorithms from the EM perspective, but we 
have not been clever enough to do so. 

In highlighting the more general MM algorithm, 
we intend no disrespect to the pioneers of the EM al- 
gorithm. If the fog of obscurity lifts from the MM al- 
gorithm, it will not detract from their achievements. 
It may, however, propel the ambitious plans for data 
mining underway in the 21st century. Even with the 
expected advances in computer hardware, the statis- 
tics community still needs to concentrate on effective 
algorithms. The MM principle is poised to claim a 
share of the credit in this enterprise. Statisticians 
with a numerical bent are well advised to add it to 
their toolkits. 
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